2023-07-25
A family of models to analyze the relationship between one outcome and one or more predictors
If I told you that last year’s average exam grade was:
\(\bar{Y} = 6.1\)
What grade would you expect to get for this year’s exam?
If I additionally told you that hours studied is strongly associated with the exam grade
And you know that you studied far more than average
Does that change your expectation for your grade?
This is regression.
If there were NO association, the mean \(\bar{Y}\) would be the best prediction for each student:
This prediction will be a bit wrong for every student:
The mean of those prediction errors (squared) is the variance of “grade”
Its square root is the SD of grade
The points appear to follow a diagonal upward line, rather than the straight line of the mean:
The distances of points from a diagonal line are obviously smaller than from the straight line of the mean:
By following the line you can kind of guess what grade you might expect for a specific number of hours studied. These predictions are better than just using the mean:
As you might remember from high school, a diagonal line is described by:
\(Y = a + bX\)
The formula for a line is:
\(Y = a + bX\)
\(a\) is the intercept, where the line crosses the Y-axis
\(b\) is the slope, how steeply the line in/decreases
We can use the line to predict values of \(Y\) for individuals \(_i\)
We want to obtain the line that gives us the best possible predictions
Substituting numeric values for the coefficients, the function to predict grade based on hours is:
\(\hat{Y}_i = 2.9 + 0.8*X_i\)
Student 71 studies 4 hours, so the predicted grade \(\hat{Y}_{71}\) is:
\(\hat{Y}_{71} = 2.9 + 0.8 * 4 = 6.1\)
In reality student 71’ grade was 8.8, so the prediction error was \(Y_i - \hat{Y}_i = 8.8 - 6.1 = 2.7\)
The formula \(Y = a + bX\) describes the diagonal line
It does not yet describe the prediction error
The linear regression model expands the formula to include prediction error
\(Y_i = a + b*X_i +e_{i}\)
\(e_{i}\) refers to the individual prediction error
\(Y_i = a + b*X_i +e_{i}\)
| Symbol | Interpretation |
|---|---|
| \(Y_i\) | Individual i’s score on dependent variable Y |
| \(a\) | Coefficient, intercept of the regression line |
| \(b\) | Coefficient, slope of the regression line |
| \(X_i\) | Individual i’s score on independent variable X |
| \(e_i\) | Individual i’s prediction error |
\(Y_i = a + b*X_i +e_{i}\)
In words, this formula says:
“The individual values on variable Y are equal to the intercept, plus the slope times the individual values on the predictor X, plus individual prediction error.”
\(Y_i = a + b*X_i +e_{i}\)
| Symbol | Interpretation |
|---|---|
| \(Y_i\) | Outcome, dependent variable (DV) |
| \(a\) | \(b_0, \beta_0\) |
| \(b\) | \(b_1, \beta_1\) |
| \(X_i\) | Predictor, independent variable (IV) |
| \(e_i\) | \(e_i\) |
\(Y_i = a + b*X_i +e_{i}\)
“The individual values on variable Y are equal to the intercept, plus the slope times the individual values on the predictor X, plus individual prediction error.”
And also:
“The individual values on variable Y are equal to the predicted values, plus individual prediction errors”
\(Y_i = \hat{Y}_i + e_{i}\)
The predicted value is the value on the regression line:
\(\hat{Y}_i = a + b*X_i\)
You can perform hypothesis tests on the coefficients a and b
Remember: hypotheses are statements about the population, so we use symbols for population parameters
We use the t-distribution because we typically don’t know population variance of \(X\) or \(Y\)
\[ t = \frac{b}{SE_b} \]
Let’s conduct a one-sided hypothesis, \(H_0: \beta_1 \leq 0\)
Or you might wonder: if I would study 0 hours, should I expect a passing grade? \(H_0: \beta_0 \leq 5.5\)
The effect of hours studied on exam grade was significant, \(b = 0.78, t(90) = 13.55, p < .001.\) This means that for every additional hour studied, the expected grade increased by 0.78 points.
Dictionary definition: “something that you accept as true without question or proof”
Why are residuals normally distributed?
https://www.youtube.com/watch?v=6YDHBFVIvIs&feature=youtu.be&t=6